1 R environment set-up

1.1 Loading packages

I first loaded required packages for data munging, visualization, and analysis (these are largely Hadley Wickham libraries, plus some Bioconductor tools).

rm(list = ls())
cleanup <- gc(verbose = FALSE)

# Load libraries I'll need here
library(edgeR)
library(limma)
library(biomaRt)
library(ggbiplot)
library(cowplot)
library(sva)

# Load my go-to libraries
library(dplyr)
library(ggplot2)
library(ggthemes)
library(stringr)
library(readr)
library(readxl)
library(reshape2)

# Packages for R markdown stuff
library(knitr)
library(shiny)

 

1.2 Defining functions

Functions for plotting metrics are contained in metric_qc_functions.R.

source("R/metric_qc_functions.R")

This is a function written by Elizabeth Whalen (shared by Michael Mason) that might come in handy with some steps of the analysis. I modified the function slightly, such that library sizes are updated and normalization factors are calculated after filtering genes. I also added the option to input gene annotation information.

# Function to build DGEList object, filter genes by keeping only those having % 
# samples with at least N counts, and computes normalization from library sizes
setUpDGEList <- function(countData, geneData = NULL, 
                         filterCount = 1, filterPercentage = 0.1)
{
    d <- DGEList(counts = countData, genes = geneData)
    # d <- calcNormFactors(d) # moved further down
    
    # Filter all genes that do not have at least 'filterCount' counts per 
    # million in at least 'filterPercentage' percent of libraries
    keepRows <- rowSums(round(cpm(d$counts)) >= filterCount) >= 
        filterPercentage*ncol(countData)
    print(table(keepRows))

    curDGE <- d[keepRows, ]
    
    # James: I've added this change so that library sizes and normalization
    # factors will always be updated/calculated after filtering genes
    
    # reset library sizes
    curDGE$samples$lib.size <- colSums(curDGE$counts)
    
    # calculate normalization factors (effective library size = 
    # lib.size * norm.factor)
    curDGE <- calcNormFactors(curDGE)
    return(curDGE)
}

 

1.3 Loading data

Next, I read counts and metrics data for the project into R, along with sample annotation for project libraries.

# Read CSV file with read counts
countFile <- "data/HMMF2ADXX_combined_counts.csv"
countDat <- read_csv(countFile) # 37991 obs. of  18 variables
# str(countDat)

# Read CSV file with RNAseq/alignment metrics
metricFile <- "data/HMMF2ADXX_combined_metrics.csv"
metricDat <- read_csv(metricFile) # 16 obs. of  71 variables
# str(metricDat)

# Read XLSX file with sample annotation
designFile <- "data/JMD119 Sample Information .xlsx"
designDat <- read_excel(designFile, skip = 1) # 36 obs. of 18 variables
# str(designDat)

 

1.4 Cleaning data

I needed to do a bit of cleaning/formatting with variable names (column headers) to make life easier and avoid breaking downstream functions.

# Separate gene counts and gene symbols into separate objects, reformat
# variable names in countDat to only include libID
geneDat <- data_frame(ensemblID = countDat$geneName)
countDat <- countDat %>% 
    select(-geneName)
names(countDat) <- names(countDat) %>% 
    str_extract("lib[0-9]+")

# Reformat variable names in metrics data frame
names(metricDat) <- names(metricDat) %>% 
    str_to_lower() %>%  # change variable names to lower case
    make.unique(sep = "_") # de-dup variable names
names(metricDat)[1] <- "lib_id" # reformat libID variable name

# Reformat row names in metrics dataframe
metricDat <- metricDat %>% 
    mutate(lib_id = str_extract(lib_id, "lib[0-9]+"))

# Reformat variable names in design data frame
names(designDat) <- names(designDat) %>% 
    str_replace_all(" +", "_") %>% # replace spaces with underscores
    str_replace_all("#", "num") %>%  # replace # with 'num'
    str_replace_all("/", "_per_") %>% 
    str_replace_all("(\\(|\\))", "") %>% # remove parentheses
    str_to_lower() %>% # change to lower case
    str_replace("(?<=(lib))[a-z]+", "") %>% # replace 'library' with 'lib'
    make.unique(sep = "_") # de-dup variable names

# Remove empty rows from design data frame
designDat <- designDat %>% 
    filter(!is.na(lib_id))

 

I created a new object to store the salient information about groups in the study I want to compare.

groupDat <- designDat %>% 
    # extract knockout status (WT or BCAP) and HSC population (long or short'
    # term); combine into a single group vector
    mutate(koStatus = as.factor(tolower(str_extract(sample_name, "WT|BCAP"))),
           hscPop = as.factor(tolower(str_extract(hsc_population, 
                                                  "Long|Short"))),
           group = as.factor(str_c(koStatus, hscPop, sep = "_"))) %>% 
    select(libID = lib_id,
           koStatus, hscPop, group)

For reference, here are the relevant groups in the data (stored in groupDat):

libID koStatus hscPop group
lib7418 wt long wt_long
lib7419 wt long wt_long
lib7420 wt long wt_long
lib7421 wt long wt_long
lib7422 bcap long bcap_long
lib7423 bcap long bcap_long
lib7424 bcap long bcap_long
lib7425 bcap long bcap_long
lib7426 wt short wt_short
lib7427 wt short wt_short
lib7428 wt short wt_short
lib7429 wt short wt_short
lib7430 bcap short bcap_short
lib7431 bcap short bcap_short
lib7432 bcap short bcap_short
lib7433 bcap short bcap_short

2 Inspecting data

2.1 Metric plotting & QC

Next, I looked at a few standard metrics to see whether any libraries should be excluded due to quality reasons.

# Pull out and format the subset of metrics to plot
metricSummary <- metricDat %>% 
    mutate(percentDuplication = unpaired_read_duplicates / 
               unpaired_reads_examined) %>% 
    select(libID = lib_id, 
           medianCVcoverage = median_cv_coverage, 
           fastqTotalReads = fastq_total_reads, 
           percentAligned = mapped_reads_w_dups,
           percentDuplication)

Cutoffs are set by default to standard values used in the Bioinformatics Core for three metrics; libraries are considered to have ‘failed’ QC for the following conditions:

  • < 1 million total FASTQ reads
  • < 80% aligned reads
  • > 1.0 median CV coverage

I can also adjust slider bars to look at different QC cutoffs (red lines) for the x- and y-axis; dashed lines indicate outlier limits (1.5*IQR).

 

2.1.1 Percent aligned

Percent aligned is simply the number of FASTQ reads for which there is a corresponding alignment in the TopHat BAM file. In other words, percentAligned = # of aligned reads (+ all their duplicate reads, which were removed from the final BAM) / fastqTotalReads.

Percent aligned:
Percentage of aligned reads is well above the 80% cutoff for all libraries, with rates in the mid-90s across the board. lib7422 is outside the nominal outlier range, but still has 91.41% alignment.

Total reads:
While all libraries had well over 10 million reads in the input FASTQ file (after adapter trimming), lib7418 appears to be quite a bit smaller than average the average of 24.27 million reads, with only 13.04 million reads.

 

2.1.2 Median CV coverage

Median CV coverage is calculated by Picard by

  1. calculating the coeficient of variation (CV) in read coverage along the length of each of the 1000 most highly expressed transcripts;
  2. calculating the median CV across these 1000 transcripts.

A high CV of read coverage suggests possible 5’ or 3’ bias, or potentially non-uniform (“bumpy” or “spikey”) coverage of a transcript. If medianCVcoverage is high (> 1), this could indicate a more systemic problem with coverage in the dataset.

Median CV coverage:
All libraries look good (with medianCVcoverage generally close to 0.5) in terms of gene coverage among the top 1000 transcripts.

 

2.2 Examining read count data

I plotted the (smoothed) frequency of log-normalized counts in each library, just to get a sense of the distribution.

Without any filtering, there’s a large spike at -1, representing genes with a count of 0, along with a smaller bump between 0 and 1, representing genes with very small counts. Notably, the distribution of counts for lib7418 is shifted dramatically to the left (which makes sense, given the much smaller number of pre-aligned reads).

 

2.2.1 Gene filtering

I used the function defined above to build the DGEList object for the data, which is the input for downstream functions.

# Filter genes with (cpm > 10) in < 10% of samples
dge = setUpDGEList(countData = countDat, geneData = geneDat,
                      filterCount = 10, 
                      filterPercentage = 0.20)
## keepRows
## FALSE  TRUE 
## 26239 11752

Keeping only those genes with >= 10 counts per million in at least 20% ( 3.2 samples), we’re left with 11752 genes.

While there are still a small fraction of genes with zero or very few counts, these no longer account for such a large percentage of genes across all libraries.

Correcting for library size, the distribution of counts per million (CPM) is more similar across all libraries, including lib7418.

 

2.2.2 Checking code edits

To verify the effect of the change I made to Elizabeth’s code above, I plotted norm.factors and effective library size (lib.size.eff) under two scenarios:

  • preFilter: norm.factors are calculated before genes are filtered
  • postFilter: norm.factors are calculated after genes are filtered and library sizes are updated

For the not-too-stringent threshold used to filter genes (CPM > 10 in 20% of samples), the order of operations for calculating norm.factors appears to have minimal impact on effective library sizes.

 

2.2.3 Gene annotation

I used the biomaRt package to add gene symbols (from MGI) corresponding to gene IDs from Ensembl.

# Get MGI gene symbols corresponding to Ensembl Gene IDs
dgeGeneDat <- dge$genes
mart <- useMart(biomart = "ensembl", dataset = "mmusculus_gene_ensembl")
ens2Gene <- getBM(attributes = c("ensembl_gene_id", "mgi_symbol"), 
                  filters = "ensembl_gene_id", 
                  values = dgeGeneDat$ensemblID, mart = mart)
ens2Gene = ens2Gene[match(dgeGeneDat$ensemblID, ens2Gene$ensembl_gene_id), ]

# Insert MGI gene symbols for genes in DGE object gene info
dgeGeneDat <- dgeGeneDat %>% 
    mutate(mgiSymbol = ens2Gene$mgi_symbol,
           mgiSymbol = ifelse(is.na(mgiSymbol), "NA", mgiSymbol))

dge$genes <- dgeGeneDat

 

2.2.4 Principal component analysis (PCA)

I performed PCA (on log-normalized CPM values) with the prcomp function and plotted the first two principal components using the ggbiplot function from the ggbiplot package.

The second principal component (PC2) appears to explain the difference between long- and short-term HSC populations – this is clearly a strong singal in the complete dataset. PC1, on the other hand, is not obviously correlated with any experimental group or any of the metrics I examined above.

Among the first two principal components, there is no clear clustering by KO status.

Looking at some of the other top principal components, PC4 appears to have the strongest (though not perfect) relationship with KO status. When PC4 is plotted against PC2 (related to HSC population), the samples separate reasonably well into the expected groups. PC3 seems to have almost an inverse relationship with PC2.

 

2.2.5 Possible sources of confounding

To try to get a better sense of what PC1 might represent in the data, I plotted several experimental, sequencing, and alignment variables against PC1 values.

While PC1 does appear to show some relationship with KO status, the two extreme libraries on the PC1 axis (lib7418 and lib7427) both also have very low RNA concentrations (ng_per_ul). None of the other variables seem to be particularly associated with PC1.

 

I also plotted the same set of variables against koStatus.

None of the variables jump out to me as being obviously correlated with koStatus, with the exception of mice age_in_weeks. Because KO mice are all at least 1/2 week older than WT mice, I don’t believe there’s any way to effectively rule out this variable as being a source of expression changes. However, I believe the age difference is small enough that the effects are relatively small.

Despite the slope of some of the fitted lines, none of the RNA-seq quality metrics appear to be particularly skewed by koStatus.


3 Differential expression analysis

3.1 Building models

Design 1: expression ~ koStatus

The first, possibly most naive, model I trained was predicting expression as a function of BCAP knockout status (i.e., koStatus).

# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesign <- model.matrix(~ koStatus, data = groupDat)
koVoom <- voomWithQualityWeights(dge, design = koDesign,
                                    plot = TRUE)

Using voomWithQualityWeights from the limma package, several libraries are downweighted in the normalized data object. The two libraries with the lowest sample weights are once again those with low ng_per_ul values (and also with extreme PC1 values).

# Fit model for the group design
koFit <- lmFit(koVoom, koDesign)
koFit <- eBayes(koFit)
koResults <- topTable(koFit, number = nrow(dge))
This model produces 10 significant genes:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.33 5.5 8.1 0 0.003 6.4
ENSMUSG00000092250 Gm20467 -0.78 5.8 -6.4 0 0.021 4.1
ENSMUSG00000040105 Ppapdc2 -1.69 5.1 -6.8 0 0.016 3.9
ENSMUSG00000026104 Stat1 -0.45 7.8 -5.7 0 0.043 2.7
ENSMUSG00000025793 Hgs 1.07 6.6 5.6 0 0.043 2.7
ENSMUSG00000038418 Egr1 0.67 7.9 5.7 0 0.043 2.7
ENSMUSG00000022412 Mief1 0.71 6.7 5.5 0 0.046 2.5
ENSMUSG00000003545 Fosb 1.43 5.4 5.6 0 0.043 2.4
ENSMUSG00000034189 Hsdl1 0.62 7.5 5.4 0 0.050 2.2
ENSMUSG00000047181 Samd14 1.18 4.8 5.6 0 0.043 2.1

logFC in this table represents the log fold-change in expression per unit change in koStatus (in other words, the log FC in wild-type relative to KO).

 

3.1.1 A note about BCAP

Interestingly, the gene for BCAP (i.e., the knock-out target), Pik3ap1, is not significant after multiple testing correction:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025017 Pik3ap1 1.1 5.9 2.6 0.019 0.34 -3.3

Based on previous inspection of the aligned data in IGV, the knockout of BCAP does not manifest entirely cleanly on the RNA level. As seen in the plots below, BCAP has at least low counts in long-term HSCs and moderate counts in short-term HSCs, even in libraries from KO mice.

 

Design 2: expression ~ koStatus + hscPop

Because HSC population (i.e., hscPop: long- vs. short-term) accounts for such a large fraction of variance in the data, it makes sense to control for this variable by including it as a parameter in the model. We lose a degree of freedom in the process, but retain more DoF than if we were to subset the data and focus only on long- or short-term mice.

# Define the design matrix, including terms corresponding to KO status and HSC
# population; use voom to calculate transformed expression values
koHscDesign <- model.matrix(~ koStatus + hscPop, data = groupDat)
koHscVoom <- voomWithQualityWeights(dge, design = koHscDesign,
                                    plot = TRUE)

# Fit model for the group design
koHscFit <- lmFit(koHscVoom, koHscDesign)
koHscFit <- eBayes(koHscFit)
koHscResults <- topTable(koHscFit, coef = 2, number = nrow(dge))

While the model includes parameters for both koStatus and hscPop, we’re most interested in the effect of koStatus on expression. By examining the results for the 2nd model coefficient (1st coefficient is the intercept), we can interpret logFC in the table below as the log FC in expression in WT relative to KO mice, when all other parameters are held constant.

This model produces 14 significant genes for koStatus:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.32 5.5 8.4 0 0.002 7.2
ENSMUSG00000064215 Ifi27 -0.64 7.3 -7.3 0 0.007 5.5
ENSMUSG00000040105 Ppapdc2 -1.72 5.1 -6.7 0 0.015 4.3
ENSMUSG00000092250 Gm20467 -0.77 5.8 -6.5 0 0.016 4.2
ENSMUSG00000026104 Stat1 -0.45 7.8 -6.4 0 0.017 3.7
ENSMUSG00000003545 Fosb 1.38 5.4 6.0 0 0.028 3.2
ENSMUSG00000032690 Oas2 -1.28 5.8 -5.8 0 0.028 3.0
ENSMUSG00000047181 Samd14 1.22 4.8 5.9 0 0.028 2.9
ENSMUSG00000022412 Mief1 0.72 6.7 5.8 0 0.028 2.9
ENSMUSG00000034189 Hsdl1 0.61 7.5 5.7 0 0.034 2.4
ENSMUSG00000025793 Hgs 1.08 6.6 5.6 0 0.038 2.4
ENSMUSG00000039738 Slx4 -0.97 5.4 -5.4 0 0.038 2.2
ENSMUSG00000066877 Nck2 1.21 4.8 5.4 0 0.038 2.1
ENSMUSG00000038418 Egr1 0.66 7.9 5.5 0 0.038 1.9

 

Pik3ap1 is still not significant after multiple testing correction:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025017 Pik3ap1 0.97 5.9 4.4 0 0.087 0.14

 

3.1.2 Surrogate variable analysis (SVA)

To dig a bit further into PC1 / ng_per_ul, which might be introducing unwanted noise into the model, I identified surrogate variables using the sva package. Surrogate variables represent sources of variance in the data, not accounted for by the primary design variables (i.e., the factors of interest); SVA is commonly used to identify and remove potential confounding variables when there may not be clear “batches” among samples.

SVAs were computed on the normalized data with the model matrix for the ~ koStatus + hscPop design

As shown in the plots below, PC1, SV1, and also the voom sample quality weights all appear to be correlated with RNA concentration (ng_per_ul).

koSva <- sva(koHscVoom$E, koHscVoom$design)
## Number of significant surrogate variables is:  3 
## Iteration (out of 5 ):1  2  3  4  5

 

Design 3: expression ~ koStatus + hscPop + ng_per_ul

To control for RNA concentration, I added the ng_per_ul as a parameter in the model.

# Define the design matrix, including terms corresponding to KO status, HSC
# population, and RNA concentration; use voom to calculate transformed 
# expression values
koHscConcDesign <- model.matrix(~ koStatus + hscPop + ng_per_ul, 
                                data = confoundDat)
koHscConcVoom <- voomWithQualityWeights(dge, design = koHscDesign,
                                    plot = TRUE)

Notably, voom still down-weights the libraries with low ng_per_ul, suggesting that the “poor” quality in these samples is not fully explained by the RNA concentration variable.

# Fit model for the group design
koHscConcFit <- lmFit(koHscConcVoom, koHscConcDesign)
koHscConcFit <- eBayes(koHscConcFit)
koHscConcResults <- topTable(koHscConcFit, coef = 2, number = nrow(dge))
This new model produces 48 significant genes for koStatus:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.39 5.5 9.3 0 0.001 7.62
ENSMUSG00000029287 Tgfbr3 0.69 7.0 7.5 0 0.008 5.62
ENSMUSG00000064215 Ifi27 -0.65 7.3 -7.0 0 0.011 4.73
ENSMUSG00000032690 Oas2 -1.37 5.8 -6.9 0 0.011 4.58
ENSMUSG00000034189 Hsdl1 0.66 7.5 6.7 0 0.012 4.24
ENSMUSG00000002413 Braf 0.80 7.3 6.4 0 0.016 3.80
ENSMUSG00000026104 Stat1 -0.45 7.8 -6.2 0 0.018 3.27
ENSMUSG00000040105 Ppapdc2 -1.69 5.1 -6.4 0 0.016 3.25
ENSMUSG00000092250 Gm20467 -0.75 5.8 -6.1 0 0.018 3.25
ENSMUSG00000022412 Mief1 0.75 6.7 6.0 0 0.018 3.15
ENSMUSG00000003545 Fosb 1.43 5.4 6.1 0 0.018 3.06
ENSMUSG00000025212 Sfxn3 0.73 6.9 5.8 0 0.024 2.63
ENSMUSG00000066877 Nck2 1.29 4.8 6.0 0 0.020 2.61
ENSMUSG00000047181 Samd14 1.26 4.8 5.9 0 0.022 2.46
ENSMUSG00000022377 Asap1 -0.51 7.4 -5.7 0 0.025 2.40
ENSMUSG00000060126 Tpt1 0.95 5.8 5.5 0 0.026 2.23
ENSMUSG00000030870 Ubfd1 0.74 7.4 5.6 0 0.026 2.21
ENSMUSG00000016496 Cd274 0.59 7.5 5.5 0 0.026 2.06
ENSMUSG00000036478 Btg1 0.88 6.4 5.4 0 0.031 2.03
ENSMUSG00000033126 Ybey 1.40 4.5 5.7 0 0.025 1.94
ENSMUSG00000001280 Sp1 0.46 8.2 5.5 0 0.026 1.91
ENSMUSG00000034487 Kdelc2 0.98 6.0 5.3 0 0.035 1.79
ENSMUSG00000000861 Bcl11a 0.50 8.9 5.5 0 0.026 1.78
ENSMUSG00000025793 Hgs 1.02 6.6 5.3 0 0.035 1.70
ENSMUSG00000009905 Kdsr 0.53 7.2 5.3 0 0.035 1.58
ENSMUSG00000024750 Zfand5 0.43 8.4 5.3 0 0.035 1.42
ENSMUSG00000039738 Slx4 -0.92 5.4 -5.1 0 0.044 1.39
ENSMUSG00000038418 Egr1 0.66 7.9 5.2 0 0.036 1.36
ENSMUSG00000021704 Mtx3 0.64 6.4 5.0 0 0.046 1.27
ENSMUSG00000022565 Plec 0.81 7.6 5.1 0 0.042 1.24
ENSMUSG00000052776 Oas1a -0.67 5.3 -5.0 0 0.047 1.23
ENSMUSG00000042198 Chchd7 0.60 6.6 5.0 0 0.046 1.21
ENSMUSG00000069114 Zbtb10 0.93 5.8 4.9 0 0.047 1.13
ENSMUSG00000021098 4930447C04Rik 0.84 5.4 4.9 0 0.047 1.06
ENSMUSG00000047632 Fgfbp3 0.90 5.1 4.9 0 0.047 1.04
ENSMUSG00000050229 Pigm 0.56 6.5 4.9 0 0.047 1.01
ENSMUSG00000000682 Cd52 -0.85 4.5 -4.9 0 0.047 0.98
ENSMUSG00000072244 Trim6 0.92 6.6 4.9 0 0.047 0.93
ENSMUSG00000048249 Crebrf 0.39 8.0 5.0 0 0.046 0.91
ENSMUSG00000022263 Trio 0.85 5.6 4.8 0 0.048 0.90
ENSMUSG00000015488 Cacfd1 1.06 5.3 4.8 0 0.048 0.89
ENSMUSG00000091219 NA 1.27 5.3 4.8 0 0.048 0.88
ENSMUSG00000025134 Alyref 0.59 7.3 4.9 0 0.047 0.86
ENSMUSG00000040483 Xaf1 -0.65 6.5 -4.8 0 0.048 0.83
ENSMUSG00000075602 Ly6a -0.47 6.8 -4.8 0 0.048 0.82
ENSMUSG00000000184 Ccnd2 0.33 8.9 4.9 0 0.047 0.57
ENSMUSG00000011267 Zfp296 -1.61 4.2 -4.9 0 0.047 0.54
ENSMUSG00000025213 Kazald1 5.11 1.5 5.2 0 0.040 -2.42

 

Pik3ap1 is still not significant after multiple testing correction: kable(koHscConcResults %>% filter(mgiSymbol == "Pik3ap1"), digits = 3, format = "html")

 

Design 3.5: expression ~ koStatus + hscPop + ng_per_ul, no quality weighting of samples

Because I’ve accounted for the most obvious source of the variance explained by PC1 - and because ng_per_ul also appears to be correlated with the sample quality weights from voom - I opted to train an additional model without the quality weighting.

# Define the design matrix, including terms corresponding to KO status, HSC
# population, and RNA concentration; use voom to calculate transformed 
# expression values
koHscConcVoomNoQual <- voom(dge, design = koHscDesign, plot = TRUE)

# Fit model for the group design
koHscConcNoQualFit <- lmFit(koHscConcVoomNoQual, koHscConcDesign)
koHscConcNoQualFit <- eBayes(koHscConcNoQualFit)
koHscConcNoQualResults <- topTable(koHscConcNoQualFit, coef = 2, number = nrow(dge))
This new model produces 132 significant genes for koStatus:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025035 Arl3 1.37 5.5 8.8 0.000 0.002 6.965
ENSMUSG00000029287 Tgfbr3 0.72 7.0 6.9 0.000 0.012 4.778
ENSMUSG00000034189 Hsdl1 0.79 7.5 6.7 0.000 0.012 4.440
ENSMUSG00000064215 Ifi27 -0.72 7.3 -6.7 0.000 0.012 4.364
ENSMUSG00000025212 Sfxn3 0.81 6.9 6.6 0.000 0.012 4.219
ENSMUSG00000002413 Braf 0.89 7.3 6.5 0.000 0.012 4.128
ENSMUSG00000036478 Btg1 1.03 6.4 6.4 0.000 0.012 3.901
ENSMUSG00000060126 Tpt1 1.13 5.8 6.4 0.000 0.012 3.845
ENSMUSG00000030870 Ubfd1 0.83 7.4 6.2 0.000 0.014 3.463
ENSMUSG00000001280 Sp1 0.57 8.2 6.1 0.000 0.014 3.292
ENSMUSG00000022412 Mief1 0.83 6.7 6.0 0.000 0.016 3.190
ENSMUSG00000040105 Ppapdc2 -1.60 5.1 -6.2 0.000 0.014 3.025
ENSMUSG00000092250 Gm20467 -0.81 5.8 -5.9 0.000 0.017 3.005
ENSMUSG00000047181 Samd14 1.22 4.8 5.9 0.000 0.017 2.529
ENSMUSG00000025793 Hgs 1.13 6.6 5.6 0.000 0.026 2.486
ENSMUSG00000016496 Cd274 0.67 7.5 5.6 0.000 0.026 2.431
ENSMUSG00000032690 Oas2 -1.38 5.8 -5.5 0.000 0.027 2.319
ENSMUSG00000025134 Alyref 0.70 7.3 5.5 0.000 0.027 2.173
ENSMUSG00000038872 Zfhx3 0.98 6.8 5.4 0.000 0.027 2.099
ENSMUSG00000066877 Nck2 1.26 4.8 5.6 0.000 0.026 2.064
ENSMUSG00000029050 Ski 0.77 7.1 5.4 0.000 0.027 1.953
ENSMUSG00000038418 Egr1 0.68 7.9 5.4 0.000 0.027 1.938
ENSMUSG00000034487 Kdelc2 1.04 6.0 5.3 0.000 0.027 1.928
ENSMUSG00000003545 Fosb 1.43 5.4 5.4 0.000 0.027 1.924
ENSMUSG00000029821 Dfna5 0.65 7.3 5.4 0.000 0.027 1.919
ENSMUSG00000042198 Chchd7 0.68 6.6 5.3 0.000 0.027 1.900
ENSMUSG00000022565 Plec 0.91 7.6 5.4 0.000 0.027 1.869
ENSMUSG00000022637 Cblb 1.10 7.3 5.3 0.000 0.029 1.707
ENSMUSG00000024750 Zfand5 0.51 8.4 5.3 0.000 0.027 1.654
ENSMUSG00000072244 Trim6 0.97 6.6 5.2 0.000 0.032 1.585
ENSMUSG00000041815 Poldip3 0.74 7.7 5.2 0.000 0.030 1.561
ENSMUSG00000009905 Kdsr 0.61 7.2 5.1 0.000 0.034 1.398
ENSMUSG00000000184 Ccnd2 0.39 8.9 5.2 0.000 0.030 1.330
ENSMUSG00000021098 4930447C04Rik 0.95 5.4 5.0 0.000 0.037 1.272
ENSMUSG00000069662 Marcks 0.99 7.0 5.0 0.000 0.037 1.257
ENSMUSG00000075602 Ly6a -0.53 6.8 -5.0 0.000 0.037 1.242
ENSMUSG00000033126 Ybey 1.31 4.5 5.2 0.000 0.030 1.222
ENSMUSG00000023026 Dip2b 0.77 7.8 5.0 0.000 0.037 1.167
ENSMUSG00000087370 Tmem170b 0.77 7.2 5.0 0.000 0.037 1.151
ENSMUSG00000027405 Nop56 -0.51 7.5 -5.0 0.000 0.037 1.142
ENSMUSG00000047632 Fgfbp3 1.00 5.1 4.9 0.000 0.037 1.095
ENSMUSG00000039738 Slx4 -0.92 5.4 -4.9 0.000 0.038 1.073
ENSMUSG00000021712 Trim23 0.81 7.8 5.0 0.000 0.037 1.061
ENSMUSG00000000682 Cd52 -0.97 4.5 -4.9 0.000 0.037 1.017
ENSMUSG00000034158 Lrrc58 0.71 7.8 5.0 0.000 0.037 1.016
ENSMUSG00000091953 NA 1.03 5.1 4.9 0.000 0.038 0.981
ENSMUSG00000050229 Pigm 0.60 6.5 4.8 0.000 0.043 0.863
ENSMUSG00000026434 Nucks1 0.58 8.6 4.9 0.000 0.037 0.829
ENSMUSG00000048154 Kmt2d 1.36 5.6 4.7 0.000 0.044 0.805
ENSMUSG00000038170 Pde4dip 0.96 6.4 4.7 0.000 0.044 0.795
ENSMUSG00000044786 Zfp36 0.40 7.1 4.8 0.000 0.043 0.784
ENSMUSG00000026064 Ptp4a1 1.05 6.2 4.7 0.000 0.044 0.774
ENSMUSG00000037336 Mfsd2b -1.65 5.4 -4.7 0.000 0.044 0.764
ENSMUSG00000059900 Tmem40 -0.59 6.9 -4.8 0.000 0.044 0.757
ENSMUSG00000053094 Tmem248 0.67 7.3 4.8 0.000 0.043 0.746
ENSMUSG00000054408 Spcs3 0.86 8.5 4.9 0.000 0.038 0.731
ENSMUSG00000029438 Bcl7a 1.14 6.6 4.7 0.000 0.044 0.699
ENSMUSG00000034928 Rnf44 0.63 6.4 4.7 0.000 0.044 0.668
ENSMUSG00000026113 Inpp4a 1.02 5.2 4.7 0.000 0.044 0.647
ENSMUSG00000008958 Vps72 0.79 6.7 4.7 0.000 0.044 0.593
ENSMUSG00000043505 Gimap5 0.50 7.4 4.7 0.000 0.044 0.562
ENSMUSG00000020160 Meis1 0.61 9.5 4.9 0.000 0.038 0.537
ENSMUSG00000092060 Bend4 1.04 5.6 4.6 0.000 0.047 0.528
ENSMUSG00000070576 Mn1 1.05 5.2 4.6 0.000 0.047 0.511
ENSMUSG00000026696 Vamp4 0.82 6.7 4.6 0.000 0.046 0.468
ENSMUSG00000062376 2010012O05Rik 0.74 6.9 4.6 0.000 0.046 0.444
ENSMUSG00000026104 Stat1 -0.42 7.8 -4.7 0.000 0.044 0.391
ENSMUSG00000033545 Znrf1 0.46 7.3 4.6 0.000 0.046 0.375
ENSMUSG00000032579 Hemk1 0.77 5.4 4.5 0.000 0.049 0.362
ENSMUSG00000008730 Hipk1 0.55 8.6 4.7 0.000 0.044 0.359
ENSMUSG00000028152 Tspan5 1.04 5.9 4.5 0.000 0.049 0.319
ENSMUSG00000052776 Oas1a -0.75 5.3 -4.5 0.000 0.049 0.278
ENSMUSG00000013593 Ndufs2 -0.51 7.1 -4.5 0.000 0.048 0.275
ENSMUSG00000039126 Prune2 -2.07 4.7 -4.6 0.000 0.045 0.275
ENSMUSG00000069769 Msi2 0.75 8.8 4.7 0.000 0.044 0.273
ENSMUSG00000029647 Pan3 0.52 8.5 4.7 0.000 0.044 0.253
ENSMUSG00000021962 Dcp1a -0.99 5.5 -4.4 0.000 0.049 0.236
ENSMUSG00000003131 Pafah1b2 0.50 8.2 4.6 0.000 0.046 0.229
ENSMUSG00000009470 Tnpo1 0.54 7.8 4.6 0.000 0.047 0.195
ENSMUSG00000025939 Ube2w 0.44 6.9 4.5 0.000 0.049 0.165
ENSMUSG00000031627 Irf2 0.46 7.1 4.5 0.000 0.049 0.113
ENSMUSG00000022263 Trio 0.84 5.6 4.4 0.000 0.049 0.098
ENSMUSG00000026782 Abi2 0.80 8.0 4.5 0.000 0.049 0.087
ENSMUSG00000040433 Zbtb38 0.47 8.3 4.5 0.000 0.048 0.085
ENSMUSG00000026276 Sept2 0.62 7.9 4.5 0.000 0.049 0.085
ENSMUSG00000071072 Ptges3 0.83 6.1 4.4 0.000 0.049 0.075
ENSMUSG00000023147 Wrb 0.84 5.7 4.3 0.000 0.049 0.056
ENSMUSG00000005533 Igf1r 1.48 5.8 4.3 0.000 0.049 0.049
ENSMUSG00000022952 Runx1 0.45 8.2 4.5 0.000 0.049 0.046
ENSMUSG00000007670 Khsrp 1.35 5.4 4.3 0.000 0.049 0.035
ENSMUSG00000079419 Ms4a6c -0.84 5.8 -4.3 0.001 0.049 0.022
ENSMUSG00000069114 Zbtb10 1.08 5.8 4.3 0.001 0.049 0.013
ENSMUSG00000025017 Pik3ap1 1.01 5.9 4.3 0.000 0.049 0.001
ENSMUSG00000034300 Fam53c 1.08 6.6 4.4 0.000 0.049 -0.002
ENSMUSG00000029322 Plac8 -0.83 6.2 -4.3 0.001 0.049 -0.004
ENSMUSG00000032035 Ets1 1.29 5.9 4.3 0.001 0.049 -0.005
ENSMUSG00000020176 Grb10 0.54 6.8 4.4 0.000 0.049 -0.010
ENSMUSG00000091219 NA 1.24 5.3 4.3 0.001 0.049 -0.011
ENSMUSG00000022377 Asap1 -0.47 7.4 -4.4 0.000 0.049 -0.015
ENSMUSG00000081594 Gm15467 1.00 4.9 4.3 0.001 0.049 -0.039
ENSMUSG00000062078 Qk 0.63 8.3 4.5 0.000 0.049 -0.042
ENSMUSG00000033713 Foxn3 0.44 7.4 4.4 0.000 0.049 -0.048
ENSMUSG00000033499 Larp4b 0.61 7.4 4.4 0.000 0.049 -0.057
ENSMUSG00000038212 Hiatl1 0.50 7.2 4.4 0.000 0.049 -0.058
ENSMUSG00000052713 Zfp608 0.30 8.2 4.5 0.000 0.049 -0.060
ENSMUSG00000048249 Crebrf 0.38 8.0 4.4 0.000 0.049 -0.094
ENSMUSG00000075376 Rc3h2 0.74 8.5 4.5 0.000 0.049 -0.117
ENSMUSG00000028521 Slc35d1 0.78 7.5 4.4 0.000 0.049 -0.123
ENSMUSG00000063410 Stk24 0.62 6.3 4.3 0.001 0.050 -0.132
ENSMUSG00000060510 Zfp266 0.73 7.8 4.4 0.000 0.049 -0.143
ENSMUSG00000021488 Nsd1 0.55 8.6 4.5 0.000 0.049 -0.154
ENSMUSG00000062646 Ganc 0.64 7.0 4.3 0.000 0.049 -0.154
ENSMUSG00000001794 Capns1 0.45 7.7 4.4 0.000 0.049 -0.160
ENSMUSG00000045362 Tnfrsf26 0.84 7.1 4.3 0.001 0.049 -0.178
ENSMUSG00000066551 Hmgb1 0.55 7.1 4.3 0.001 0.049 -0.246
ENSMUSG00000021134 Srsf5 0.50 8.3 4.4 0.000 0.049 -0.247
ENSMUSG00000056999 Ide 0.59 7.6 4.3 0.000 0.049 -0.263
ENSMUSG00000062822 4833420G17Rik 0.39 7.7 4.3 0.000 0.049 -0.276
ENSMUSG00000020962 Gtf2a1 0.81 7.2 4.3 0.001 0.049 -0.288
ENSMUSG00000048379 Socs4 0.56 7.8 4.3 0.000 0.049 -0.295
ENSMUSG00000004880 Lbr 0.52 8.1 4.3 0.000 0.049 -0.310
ENSMUSG00000000861 Bcl11a 0.51 8.9 4.4 0.000 0.049 -0.368
ENSMUSG00000020015 Cdk17 0.45 8.6 4.4 0.000 0.049 -0.371
ENSMUSG00000003949 Hlf 0.38 9.2 4.4 0.000 0.049 -0.388
ENSMUSG00000025935 Tram1 0.44 8.4 4.3 0.001 0.049 -0.432
ENSMUSG00000069056 NA 1.65 4.0 4.3 0.001 0.049 -0.517
ENSMUSG00000037965 Zc3h7a 0.41 8.5 4.3 0.001 0.049 -0.520
ENSMUSG00000026034 Clk1 0.53 9.1 4.3 0.001 0.049 -0.623
ENSMUSG00000032372 Plscr2 -4.38 2.6 -4.5 0.000 0.049 -2.125
ENSMUSG00000001943 Vsig2 -2.67 1.9 -4.4 0.000 0.049 -2.213
ENSMUSG00000025213 Kazald1 4.78 1.5 4.8 0.000 0.044 -2.510
ENSMUSG00000064202 4430402I18Rik -3.62 1.4 -4.5 0.000 0.049 -2.518

 

Pik3ap1 is borderline significant after multiple testing correction:
ensemblID mgiSymbol logFC AveExpr t P.Value adj.P.Val B
ENSMUSG00000025017 Pik3ap1 1 5.9 4.3 0 0.049 0.001

While, for the reasons stated above, differential BCAP expression should probably not be viewed as a positive control, it’s at least somewhat encouraging to see the model better account for this difference.

 

3.2 Model (design) comparison

The plot below shows the full list of genes that were found to be significant with one or more of the above model designs. For each gene, a colored bar indicates the design under which it was found to be significant (the size of the bar is scaled by the adjusted p-value).

The gene lists for designs 1 and 2 overlap completely with designs 3 and 3.5; there were 4 genes from design 3 that were not significant in design 3.5.

Genes with significantly different expression (adj. p-value < 0.05) as a function of BCAP KO status:


4 Examining HSC populations separately

Just to check, I also tested for differential expression with only libraries from the same HSC population included. For both long- and short-term HSCs, there were no significant genes for koStatus.

4.1 Long-term HSC mice only

# Get data for libraries from long-term HSC population
groupDatLong <- groupDat %>% 
    filter(hscPop == "long")
countDatLong <- countDat[, names(countDat) %in% groupDatLong$libID]

# Filter genes with (cpm > 10) in < 10% of samples
dgeLong = setUpDGEList(countData = countDatLong, geneData = geneDat,
                      filterCount = 10, 
                      filterPercentage = 0.20)
## keepRows
## FALSE  TRUE 
## 26152 11839

 

Design: expression ~ koStatus

# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesignLong <- model.matrix(~ koStatus, data = groupDatLong)
koVoomLong <- voomWithQualityWeights(dgeLong, design = koDesignLong,
                                    plot = TRUE)

# Fit model for the group design
koFitLong <- lmFit(koVoomLong, koDesignLong)
koFitLong <- eBayes(koFitLong)
koResultsLong <- topTable(koFitLong, number = nrow(dgeLong))

This model produces 0 significant genes for koStatus.

 

4.2 Short-term HSC mice only

# Get data for libraries from long-term HSC population
groupDatShort <- groupDat %>% 
    filter(hscPop == "short")
countDatShort <- countDat[, names(countDat) %in% groupDatShort$libID]

# Filter genes with (cpm > 10) in < 10% of samples
dgeShort = setUpDGEList(countData = countDatShort, geneData = geneDat,
                      filterCount = 10, 
                      filterPercentage = 0.20)
## keepRows
## FALSE  TRUE 
## 26097 11894

 

Design: expression ~ koStatus

# Define the design matrix, only including terms corresponding to KO status;
# use voom to calculate transformed expression values
koDesignShort <- model.matrix(~ koStatus, data = groupDatShort)
koVoomShort <- voomWithQualityWeights(dgeShort, design = koDesignShort,
                                    plot = TRUE)

# Fit model for the group design
koFitShort <- lmFit(koVoomShort, koDesignShort)
koFitShort <- eBayes(koFitShort)
koResultsShort <- topTable(koFitShort, number = nrow(dgeShort))

This model produces 0 significant genes for koStatus.


5 Session info

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.3 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggfortify_0.0.3   proto_0.3-10      shiny_0.12.2     
##  [4] knitr_1.11        reshape2_1.4.1    readxl_0.1.0.9000
##  [7] readr_0.1.1       stringr_1.0.0     ggthemes_2.2.1   
## [10] dplyr_0.4.2       sva_3.14.0        genefilter_1.50.0
## [13] mgcv_1.8-7        nlme_3.1-121      cowplot_0.5.0    
## [16] ggbiplot_0.55     scales_0.2.5      plyr_1.8.3       
## [19] ggplot2_1.0.1     biomaRt_2.24.0    edgeR_3.10.2     
## [22] limma_3.24.15    
## 
## loaded via a namespace (and not attached):
##  [1] splines_3.2.1        lattice_0.20-33      colorspace_1.2-6    
##  [4] htmltools_0.2.6      stats4_3.2.1         yaml_2.1.13         
##  [7] XML_3.98-1.3         survival_2.38-3      DBI_0.3.1           
## [10] BiocGenerics_0.14.0  munsell_0.4.2        gtable_0.1.2        
## [13] codetools_0.2-14     evaluate_0.7.2       labeling_0.3        
## [16] Biobase_2.28.0       IRanges_2.2.7        httpuv_1.3.3        
## [19] GenomeInfoDb_1.4.2   parallel_3.2.1       AnnotationDbi_1.30.1
## [22] highr_0.5            Rcpp_0.12.0          xtable_1.7-4        
## [25] formatR_1.2          S4Vectors_0.6.3      annotate_1.46.1     
## [28] mime_0.3             gridExtra_2.0.0      digest_0.6.8        
## [31] stringi_0.5-5        tools_3.2.1          bitops_1.0-6        
## [34] magrittr_1.5         RCurl_1.95-4.7       lazyeval_0.1.10     
## [37] RSQLite_1.0.0        tidyr_0.2.0          MASS_7.3-43         
## [40] Matrix_1.2-2         assertthat_0.1       rmarkdown_0.7       
## [43] R6_2.1.1